Introduction

The Simpsons holds the title of the world’s longest-running animated sitcom. It was created by the American writer Matt Groening in 1989 and is nowadays in its 31st season. The show gives a satirical depiction of the working-class life of the Simpson family, which mainly consists of Homer, Marge, Bart, Lisa, and the little Maggie.

Several resources about the show are available at the data science platform Kaggle and the #tidytuesday Github repository. Let us explore these data sets and see which insights we can get from them. We start by loading the packages required for the analysis.

The five data sets we will be working with are simpsons_characters.csv, simpsons_locations.csv, simpsons_script_lines.csv, simpsons_episodes.csv, and simpsons-guests.csv.

 

Pre-processing

Before starting off with the actual analysis, it is convenient to perform some pre-processing operations.

  • characters data set: we recode the levels of gender as Male, Female, and Unknown.
  • dialogues data set: we retain the speaking lines and create identifiers for the row number.
  • episodes data set: we create a variable denoting the episode number. Its format follows the rules described here (i.e., the first number refers to the order it aired during the entire series, and the second one to the episode number within its season). We also filter the episodes until season 27, since the information about season 28 is partial (there are only 4 episodes).
  • guests data set: we exclude the Movie season entries and create a logical variable self, which is equal to TRUE if a given guest star has played themselves in a particular episode, and FALSE if they have voiced a regular character.

Let us have a look at the pre-processed data sets characters, locations, episodes, dialogues, and guests.

Characters

Characters data set (first 50 rows)
id name normalized_name gender
7 Children children Unknown
12 Mechanical Santa mechanical santa Unknown
13 Tattoo Man tattoo man Unknown
16 DOCTOR ZITSOFSKY doctor zitsofsky Unknown
20 Students students Unknown
24 Little Boy little boy Unknown
26 Lewis Clark lewis clark Unknown
27 Little Girl little girl Unknown
29 Bubbles bubbles Unknown
30 Moldy moldy Unknown
34 Ticket Seller ticket seller Unknown
35 Elf #1 elf 1 Unknown
36 Elves elves Unknown
37 Dog’s Owner dogs owner Unknown
39 Kids kids Unknown
41 Conductor conductor Unknown
42 Secretary secretary Unknown
46 Sydney sydney Unknown
47 Cecile Shapiro cecile shapiro Unknown
48 Ian ian Unknown
49 Calvin calvin Unknown
50 Martin Prince, Sr.  martin prince sr Unknown
51 Richard richard Unknown
53 Wendell Borton wendell borton Unknown
57 Smilin’ Joe Fission smilin joe fission Unknown
58 Rod #1 rod 1 Unknown
59 Rod #2 rod 2 Unknown
60 RODS rods Unknown
61 Workman #1 workman 1 Unknown
62 Foreman foreman Unknown
63 TERRI & SHERRI terri sherri Unknown
64 PUNK TEENAGER punk teenager Unknown
65 Tv Announcer #1 tv announcer 1 Unknown
66 Tv Announcer #2 tv announcer 2 Unknown
67 Jingle Chorus jingle chorus Unknown
68 Sylvia Winfield sylvia winfield Unknown
69 Old Man Winfield old man winfield Unknown
70 Councilman #1 councilman 1 Unknown
72 Councilman #2 councilman 2 Unknown
73 COUNCILMEN #1/#2 councilmen 12 Unknown
74 Demonstrator #1 demonstrator 1 Unknown
75 Crowd crowd Unknown
76 MR. GAMMILL mr gammill Unknown
77 TOM tom Unknown
78 Mrs. Long mrs long Unknown
79 Wife #1 wife 1 Unknown
80 Wife #2 wife 2 Unknown
81 Other Women other women Unknown
82 Nice Father nice father Unknown
83 Nice Boy nice boy Unknown

Locations

Location data set (first 50 rows)
id name normalized_name
1 Street street
2 Car car
3 Springfield Elementary School springfield elementary school
4 Auditorium auditorium
5 Simpson Home simpson home
6 KITCHEN kitchen
7 SHOPPING MALL PARKING LOT shopping mall parking lot
8 Springfield Mall springfield mall
9 The Happy Sailor Tattoo Parlor the happy sailor tattoo parlor
10 Springfield Nuclear Power Plant springfield nuclear power plant
11 PLANT plant
12 DERMATOLOGY CLINIC dermatology clinic
13 Laboratory laboratory
14 Circus of Values circus of values
15 Moe’s Tavern moe tavern
16 Santa School santa school
17 Santa’s Workshop santa workshop
18 WORKSHOP workshop
19 PERSONNEL OFFICE personnel office
20 Springfield Downs Dog Track springfield downs dog track
21 SPRINGFIELD DOWNS springfield downs
22 PADDOCK paddock
23 SPRINGFIELD DOWN springfield down
24 SPRINGFIELD DOWNS PARKING LOT springfield downs parking lot
25 Simpson Living Room simpson living room
26 Springfield Elementary School Playground springfield elementary school playground
27 CLASSROOM classroom
28 Skinner’s Office skinner office
29 Homer’s Car homer car
30 NEW SCHOOL new school
31 Opera House opera house
32 OLD SCHOOL old school
33 NEW CLASSROOM new classroom
34 SCHOOL BUILDING school building
35 Simpson Back Porch simpson back porch
36 Bus bus
37 Road road
38 Conference Room conference room
39 COFFEE ROOM coffee room
40 Bar bar
41 Berger’s Burgers berger burgers
42 REFRIGERATOR refrigerator
43 Bart’s Bedroom bart bedroom
44 Simpson Backyard simpson backyard
45 Simpson Neighborhood simpson neighborhood
46 Master Bedroom master bedroom
47 LIVING ROOM living room
48 Springfield Town Hall springfield town hall
49 CITY COUNCIL CHAMBERS city council chambers
50 Park park

Episodes

Episodes data set (first 50 rows)
season number episode_id prod_code year title rating votes views us_views
1 010–110 10 7G10 1,990 Homer’s Night Out 7.4 1,511 50,816 30.3
1 012–112 12 7G12 1,990 Krusty Gets Busted 8.3 1,716 62,561 30.4
2 014–201 14 7F03 1,990 Bart Gets an “F” 8.2 1,638 59,575 33.6
2 017–204 17 7F01 1,990 Two Cars in Every Garage and Three Eyes on Every Fish 8.1 1,457 64,959 26.1
2 019–206 19 7F08 1,990 Dead Putting Society 8.0 1,366 50,691 25.4
2 021–208 21 7F06 1,990 Bart the Daredevil 8.4 1,522 57,605 26.2
2 023–210 23 7F10 1,991 Bart Gets Hit by a Car 7.8 1,340 56,486 24.8
2 026–213 26 7F13 1,991 Homer vs. Lisa and the 8th Commandment 8.0 1,329 58,277 26.2
2 028–215 28 7F16 1,991 Oh Brother, Where Art Thou? 8.2 1,413 47,426 26.8
2 030–217 30 7F17 1,991 Old Money 7.6 1,243 44,331 21.2
2 032–219 32 7F19 1,991 Lisa’s Substitute 8.5 1,684 52,770 17.7
2 035–222 35 7F22 1,991 Blood Feud 8.0 1,223 52,829 17.3
3 037–302 37 8F01 1,991 Mr. Lisa Goes to Washington 7.7 1,274 52,098 20.2
3 039–304 39 8F03 1,991 Bart the Murderer 8.7 1,446 64,342 20.8
3 041–306 41 8F05 1,991 Like Father, Like Clown 7.7 1,262 45,586 20.2
3 044–309 44 8F07 1,991 Saturdays of Thunder 7.9 1,194 55,808 24.7
3 046–311 46 8F09 1,991 Burns Verkaufen der Kraftwerk 8.2 1,291 55,987 21.1
3 048–313 48 8F11 1,992 Radio Bart 8.5 1,365 58,919 24.2
3 051–316 51 8F16 1,992 Bart the Lover 8.3 1,272 53,123 20.5
3 053–318 53 8F15 1,992 Separate Vocations 8.2 1,201 61,508 23.7
3 055–320 55 8F19 1,992 Colonel Homer 7.9 1,233 46,901 25.5
3 058–323 58 8F22 1,992 Bart’s Friend Falls in Love 7.8 1,160 48,058 19.5
4 060–401 60 8F24 1,992 Kamp Krusty 8.4 1,414 67,081 21.8
4 065–406 65 9F03 1,992 Itchy & Scratchy: The Movie 8.2 1,293 55,740 20.1
4 069–410 69 9F08 1,992 Lisa’s First Word 8.5 1,350 62,070 28.6
4 072–413 72 9F11 1,993 Selma’s Choice 8.0 1,153 56,396 24.5
1 007–107 7 7G09 1,990 The Call of the Simpsons 7.9 1,638 57,793 27.6
2 024–211 24 7F12 1,991 One Fish, Two Fish, Blowfish, Blue Fish 8.8 1,687 50,206 24.2
4 080–421 80 9F20 1,993 Marge in Chains 7.7 1,080 68,692 17.3
5 082–501 82 9F21 1,993 Homer’s Barbershop Quartet 8.4 1,416 58,390 19.9
5 084–503 84 1F02 1,993 Homer Goes to College 8.6 1,476 64,802 18.1
5 087–506 87 1F03 1,993 Marge on the Lam 8.0 1,132 53,490 21.7
5 089–508 89 1F06 1,993 Boy-Scoutz ’n the Hood 8.7 1,270 83,238 20.1
5 092–511 92 1F09 1,994 Homer the Vigilante 8.2 1,202 74,673 20.1
5 093–512 93 1F11 1,994 Bart Gets Famous 8.1 1,123 66,267 20.0
5 095–514 95 1F12 1,994 Lisa vs. Malibu Stacy 8.2 1,187 61,715 20.5
5 098–517 98 1F15 1,994 Bart Gets an Elephant 7.9 1,116 63,427 17.0
5 102–521 102 1F21 1,994 Lady Bouvier’s Lover 7.5 1,014 59,503 15.1
6 104–601 104 1F22 1,994 Bart of Darkness 8.6 1,330 65,126 15.1
6 107–604 107 2F01 1,994 Itchy & Scratchy Land 8.5 1,277 72,722 14.8
6 111–608 111 2F05 1,994 Lisa on Ice 8.4 1,236 63,564 17.9
6 114–611 114 2F08 1,994 Fear of Flying 7.8 1,100 61,569 15.6
6 116–613 116 2F10 1,995 And Maggie Makes Three 8.5 1,284 63,051 17.3
6 118–615 118 2F12 1,995 Homie the Clown 8.5 1,254 73,123 17.6
6 120–617 120 2F14 1,995 Homer vs. Patty and Selma 7.9 1,006 60,599 18.9
6 123–620 123 2F18 1,995 Two Dozen and One Greyhounds 8.1 1,051 62,323 11.6
6 125–622 125 2F32 1,995 ’Round Springfield 8.3 1,084 56,001 12.6
6 127–624 127 2F22 1,995 Lemon of Troy 8.6 1,285 70,698 13.1
7 130–702 130 2F17 1,995 Radioactive Man 8.3 1,172 62,390 15.7
7 132–704 132 3F02 1,995 Bart Sells His Soul 8.7 1,354 65,333 14.8

Dialogues

Dialogues data set (first 50 rows)
line_number episode_id number role location line
1 32 209 Miss Hoover Springfield Elementary School No, actually, it was a little of both. Sometimes when a disease is in all the magazines and all the news shows, it’s only natural that you think you have it.
2 32 210 Lisa Simpson Springfield Elementary School Where’s Mr. Bergstrom?
3 32 211 Miss Hoover Springfield Elementary School I don’t know. Although I’d sure like to talk to him. He didn’t touch my lesson plan. What did he teach you?
4 32 212 Lisa Simpson Springfield Elementary School That life is worth living.
5 32 213 Edna Krabappel-Flanders Springfield Elementary School The polls will be open from now until the end of recess. Now, just in case any of you have decided to put any thought into this, we’ll have our final statements. Martin?
6 32 214 Martin Prince Springfield Elementary School I don’t think there’s anything left to say.
7 32 215 Edna Krabappel-Flanders Springfield Elementary School Bart?
8 32 216 Bart Simpson Springfield Elementary School Victory party under the slide!
9 32 218 Lisa Simpson Apartment Building Mr. Bergstrom! Mr. Bergstrom!
10 32 219 Landlady Apartment Building Hey, hey, he Moved out this morning. He must have a new job – he took his Copernicus costume.
11 32 220 Lisa Simpson Apartment Building Do you know where I could find him?
12 32 221 Landlady Apartment Building I think he’s taking the next train to Capital City.
13 32 222 Lisa Simpson Apartment Building The train, how like him… traditional, yet environmentally sound.
14 32 223 Landlady Apartment Building Yes, and it’s been the backbone of our country since Leland Stanford drove that golden spike at Promontory point.
15 32 224 Lisa Simpson Apartment Building I see he touched you, too.
16 32 226 Bart Simpson Springfield Elementary School Hey, thanks for your vote, man.
17 32 227 Nelson Muntz Springfield Elementary School I didn’t vote. Voting’s for geeks.
18 32 228 Bart Simpson Springfield Elementary School Well, you got that right. Thanks for your vote, girls.
19 32 229 Terri/sherri Springfield Elementary School We forgot.
20 32 230 Bart Simpson Springfield Elementary School Well, don’t sweat it. Just so long as a couple of people did… right, Milhouse?
21 32 231 Milhouse Van Houten Springfield Elementary School Uh oh.
22 32 232 Bart Simpson Springfield Elementary School Lewis?
23 32 233 Bart Simpson Springfield Elementary School Somebody must have voted.
24 32 234 Milhouse Van Houten Springfield Elementary School What about you, Bart? Didn’t you vote?
25 32 235 Bart Simpson Springfield Elementary School Uh oh.
26 32 237 Wendell Borton Springfield Elementary School Yayyyyyyyyyyyyyy!
27 32 238 Bart Simpson Springfield Elementary School I demand a recount.
28 32 239 Edna Krabappel-Flanders Springfield Elementary School One for Martin, two for Martin. Would you like another recount?
29 32 240 Bart Simpson Springfield Elementary School No. 
30 32 241 Edna Krabappel-Flanders Springfield Elementary School Well, I just want to make sure. One for Martin. Two for Martin.
31 32 242 Kid Reporter Springfield Elementary School This way, Mister President!
32 32 244 Conductor Train Station Now boarding on track 5, The afternoon delight coming to Shelbyville, Parkville, and…..
33 32 245 Lisa Simpson Train Station Mr. Bergstrom! Hey, Mr. Bergstrom!
34 32 246 BERGSTROM Train Station Hey, Lisa.
35 32 247 Lisa Simpson Train Station Hey, Lisa, indeed.
36 32 248 BERGSTROM Train Station What? What is it?
37 32 249 Lisa Simpson Train Station Oh, I mean, were you just going to leave, just like that?
38 32 250 BERGSTROM Train Station Ah, I’m sorry, Lisa. You know, it’s the life of the substitute teacher: he’s a fraud. Today he might be wearing gym shorts, tomorrow he’s speaking French, or, or, or pretending to know how to run a band saw, or God knows what.
39 32 251 Lisa Simpson Train Station You can’t go! You’re the best teacher I’ll ever have.
40 32 252 BERGSTROM Train Station Ah, that’s not true. Other teachers will come along who…
41 32 253 Lisa Simpson Train Station Oh, please.
42 32 254 BERGSTROM Train Station No, I can’t lie to you, I am the best. But, you know, they need me over in the projects of Capital City.
43 32 255 Lisa Simpson Train Station But I need you too.
44 32 256 BERGSTROM Train Station That’s the problem with being middle class. Anybody who really cares will abandon you for those who need it more.
45 32 257 Lisa Simpson Train Station I, I understand. Mr. Bergstrom, I’m going to miss you.
46 32 258 BERGSTROM Train Station I’ll tell you what…
47 32 259 BERGSTROM Train Station Whenever you feel like you’re alone and there’s nobody you can rely on, this is all you need to know.
48 32 260 Lisa Simpson Train Station Thank you, Mr. Bergstrom.
49 32 261 Conductor Train Station All aboard!
50 32 262 Lisa Simpson Train Station So, I guess this is it? It you don’t mind I’ll just run alongside the train as it speeds you from my life?

Guests

Guests data set (first 50 rows)
season number prod_code title guest_star role self
1 002–102 7G02 Bart the Genius Marcia Wallace Edna Krabappel-Flanders FALSE
1 002–102 7G02 Bart the Genius Marcia Wallace Ms. Melon FALSE
1 003–103 7G03 Homer’s Odyssey Sam McMurray Worker FALSE
1 003–103 7G03 Homer’s Odyssey Marcia Wallace Edna Krabappel-Flanders FALSE
1 006–106 7G06 Moaning Lisa Miriam Flynn Ms. Barr FALSE
1 006–106 7G06 Moaning Lisa Ron Taylor Bleeding Gums Murphy FALSE
1 007–107 7G09 The Call of the Simpsons Albert Brooks Cowboy Bob FALSE
1 008–108 7G07 The Telltale Head Marcia Wallace Edna Krabappel-Flanders FALSE
1 009–109 7G11 Life on the Fast Lane Albert Brooks Jacques FALSE
1 010–110 7G10 Homer’s Night Out Sam McMurray Gulliver Dark FALSE
1 011–111 7G13 The Crepes of Wrath Christian Coffinet Gendarme Officer FALSE
1 012–112 7G12 Krusty Gets Busted Kelsey Grammer Sideshow Bob FALSE
1 013–113 7G01 Some Enchanted Evening June Foray Babysitter service receptionist FALSE
1 013–113 7G01 Some Enchanted Evening June Foray Doofy the Elf FALSE
1 013–113 7G01 Some Enchanted Evening Penny Marshall Ms. Botz FALSE
1 013–113 7G01 Some Enchanted Evening Penny Marshall Lucille Botzcowski FALSE
1 013–113 7G01 Some Enchanted Evening Paul Willson Florist FALSE
2 014–201 7F03 Bart Gets an “F” Marcia Wallace Edna Krabappel-Flanders FALSE
2 015–202 7F02 Simpson and Delilah Harvey Fierstein Karl FALSE
2 016–203 7F04 Treehouse of Horror James Earl Jones Removal man FALSE
2 016–203 7F04 Treehouse of Horror James Earl Jones Serak the Preparer FALSE
2 016–203 7F04 Treehouse of Horror James Earl Jones Narrator FALSE
2 018–205 7F05 Dancin’ Homer Tony Bennett Himself TRUE
2 018–205 7F05 Dancin’ Homer Daryl Coley Bleeding Gums Murphy FALSE
2 018–205 7F05 Dancin’ Homer Ken Levine Dan Horde FALSE
2 018–205 7F05 Dancin’ Homer Tom Poston Capital City Goofball FALSE
2 020–207 7F07 Bart vs. Thanksgiving Greg Berg Rory FALSE
2 020–207 7F07 Bart vs. Thanksgiving Greg Berg Eddie FALSE
2 020–207 7F07 Bart vs. Thanksgiving Greg Berg Radio voice FALSE
2 020–207 7F07 Bart vs. Thanksgiving Greg Berg “Hooray for Everything” Announcer FALSE
2 020–207 7F07 Bart vs. Thanksgiving Greg Berg Security Man FALSE
2 020–207 7F07 Bart vs. Thanksgiving Carol Kane Maggie Simpson FALSE
2 022–209 7F09 Itchy & Scratchy & Marge Alex Rocco Roger Meyers Jr.  FALSE
2 023–210 7F10 Bart Gets Hit by a Car Phil Hartman Lionel Hutz FALSE
2 023–210 7F10 Bart Gets Hit by a Car Phil Hartman Heaven FALSE
2 024–211 7F11 One Fish, Two Fish, Blowfish, Blue Fish Larry King Himself TRUE
2 024–211 7F11 One Fish, Two Fish, Blowfish, Blue Fish Joey Miyashima Toshiro FALSE
2 024–211 7F11 One Fish, Two Fish, Blowfish, Blue Fish Sab Shimono Master Sushi Chef FALSE
2 024–211 7F11 One Fish, Two Fish, Blowfish, Blue Fish George Takei Akira FALSE
2 024–211 7F11 One Fish, Two Fish, Blowfish, Blue Fish Diana Tanaka Hostess FALSE
2 025–212 7F12 The Way We Was Jon Lovitz Artie Ziff FALSE
2 025–212 7F12 The Way We Was Jon Lovitz Mr. Seckofsky FALSE
2 026–213 7F13 Homer vs. Lisa and the 8th Commandment Phil Hartman Troy McClure FALSE
2 026–213 7F13 Homer vs. Lisa and the 8th Commandment Phil Hartman Moses FALSE
2 026–213 7F13 Homer vs. Lisa and the 8th Commandment Phil Hartman Cable guy FALSE
2 027–214 7F15 Principal Charming Marcia Wallace Edna Krabappel-Flanders FALSE
2 028–215 7F16 Oh Brother, Where Art Thou? Danny DeVito Herbert Powell FALSE
2 029–216 7F14 Bart’s Dog Gets an F Tracey Ullman Emily Winthropp FALSE
2 029–216 7F14 Bart’s Dog Gets an F Tracey Ullman Sylvia Winfield FALSE
2 029–216 7F14 Bart’s Dog Gets an F Frank Welker Santa’s Little Helper FALSE

Let us join dialogues with information contained in characters and episodes. This allows us to know the gender of the most talkative characters and the season of each episode. Notice that dialogues terminates at episode 568 (episode 16 of season 26), whereas episodes at episode 596 (last episode of season 27).

Let us also join episodes with information from guests. This will be useful to get, for instance, the guest star names for each episode (if any), and the roles they played. Note that guests terminates at episode 662 (episode 23 of season 30).

Lastly, we tidy dialogues, which conveniently puts the speaking lines into a one-word-per-row format.

 

Characters

The Simpsons is known for its vast ensemble of leading and supporting characters. The characters data set collects the names of 6722 characters that appeared throughout the seasons. In 95% of the cases, the gender of the character is not recorded. However, this is not problematic since those characters are not particularly relevant to the development of the show, as they only account for about 15% of the whole dialogues.

Contribution to dialogues

How each gender contributes to the dialogues of 26 seasons
Gender Frequency Proportion
Male 842,670 63.9%
Female 274,362 20.8%
Unknown 202,416 15.3%

The Simpsons is characterized by a marked gender imbalance, which is also reflected in the show’s writing staff. More than 75% of the characters with recorded gender are male. The only female leading characters are Marge and Lisa Simpson. In contrast, among the supporting cast, we find Edna Krappabel-Flanders (the teacher at Springfield Elementary School), and the twins Selma and Patty Bouvier (Marge’s older sisters).

Let us look at how the total number of words of the top four leading characters (Homer, Marge, Bart, and Lisa Simpson) has evolved throughout the seasons.

   

Locations

The Simpsons show is mainly set in Springfield, a fictional town acting like a universe in which the characters can explore the issues faced by modern society. Although the locations data set reports 4459 distinct settings, most of the dialogues actually take place in way fewer places.

Let us have a look at the 20 most common locations, that is, the settings where the characters had the most dialogue. At the top, we find The Simpson home, followed by Springfield Elementary School, and Moe’s Tavern. The majority of these locations denote indoor settings.

 

It would be interesting to investigate which characters have the most dialogue within each of these locations. Let us consider for simplicity the top six locations.

  • Simpson home: it is the house of the Simpson family. Of course, the most talkative characters here are the Simpson family members, granddad included.

  • Springfield Elementary School: it is the local school on The Simpsons, attended by Bart and Lisa Simpson. Besides the two of them, the other leading characters include the principal Skinner, the superintendent Chalmers and the teacher Edna.

  • Moe’s Tavern: it is the local bar in Springfield. The dialogue here mostly occurs between the owner Moe and his guests Homer, Lenny, Carl, and Barney.

  • Springfield Nuclear Power Plant: it is the nuclear power plant in Springfield. The leading characters here are Mr. Burns, who owns the plant, his executive assistant Smithers, and the employees Homer, Lenny, and Carl.

  • Kwik-E-Mart: it is the convenience store run by Apu. The dialogues here usually involve the owner of the store and the members of the Simpson family.

  • First Church of Springfield: it is the main religious house in Springfield. Most of the dialogues here occur between the Reverend, the Simpson family, and their very religious next-door neighbor Ned.

   

Ratings and TV views

The episodes data set contains interesting information on the IMDb (Internet Movie Database) rating of each episode on a 1 - 10 scale, the number of votes it received, and the number of TV views in the United States.

Let us have a look at the ratings, the number of votes, and the TV views across episodes, as well as averaged by season.

Seasons

line_plot <- function(data, x, y, xlab = "Season", ylab, title, sub,
                      limits = NULL, breaks = NULL, labels = comma){

  x <- enquo(x)
  y <- enquo(y)

  p <- data %>%
    ggplot(aes(!! x, !! y)) +
    geom_line(size = 1.2, color = "#8d99ae") +
    geom_point(shape=21, color=colors[1], fill=colors[1], size=1) +
    scale_x_continuous(breaks = seq(1, 27, 4)) +
    labs(x = xlab, y = ylab, title = title, subtitle = sub) +
    theme(plot.subtitle=element_text(size=9))
    if(!is.null(breaks)){
     p <- p +
       scale_y_continuous(labels = labels, breaks = breaks, limits = limits)
    }else{
     p <- p +
       scale_y_continuous(labels = labels, limits = limits)
    }
  p
}

episodes_byseason <- episodes %>%
  group_by(season) %>%
  summarize(avg_rate  = mean(rating),
            avg_vote  = mean(votes),
            avg_views = mean(us_views))

plot.rating <- episodes_byseason %>%
  line_plot(season, avg_rate, ylab = "Rating score",  title = "IMDb ratings",
            sub = "Averaged by season \nOlder seasons are the most appreciated.",
            breaks = seq(0, 10, 2), limits = c(1,10))

plot.vote <- episodes_byseason %>%
  line_plot(season, avg_vote, ylab = "Number of votes", title = "IMDb votes",
            sub = "Averaged by season \nOlder seasons are the most rated.",
            limits = c(0, 2000))

plot.view <- episodes_byseason %>%
  line_plot(season, avg_views, ylab = "Number of US viewers", title = "TV views in the US",
            sub = "Averaged by season \nOlder seasons are the most viewed.",
            labels = unit_format(unit = "", scale = 1e+6, big.mark = ","),
            limits = c(0, 30))

grid.arrange(plot.rating, plot.vote, plot.view, nrow = 1, widths = c(0.95, 0.97, 1.08))

We notice an overall downward trend for all three indicators. Older seasons seem to be the most appreciated, most rated, and most viewed. As a matter of fact, The Simpsons show received acclaim throughout its first nine or ten seasons, which are generally considered its “Golden Age”, but has been criticized for a perceived decline in quality over the years.

To be fair, the availability in recent years of a variety of network channels and Internet streaming platforms has caused a systematic drop in the number of TV views not only for The Simpsons but also for many other shows.

Let us look at the pairwise-relationships between IMDb ratings, IMDb votes, and TV views. There seem to be positive linear relationships between each pair, suggesting that when one of those indicators increases, so do the other two.

The correlation matrix of episodes certainly shows strong inverse relationships between the episode number and the ratings, the votes, and the views. This indicates that more recent seasons generally have lower values of those performance indicators.

   

Guest stars

In addition to the show’s regular cast of voice actors, celebrity guest stars have been a staple of The Simpsons since its first season. Guest voices have come from a wide range of professions, including actors, athletes, authors, musicians, artists, politicians, and scientists.

The guests data set contains information about the guest stars that took part in every episode and their role. Let us investigate the most recurring guest voices and roles.

Frequent guest stars

The most frequent guest stars across 30 seasons have been Marcia Wallace, Phil Hartman, and Maurice LaMarche. Whereas Marcia Wallace has almost always played the teacher Edna, the other two stars have actually voiced numerous characters on the show.

Top 5 guest stars and the roles they played
Guest star Roles they played Number of appearences
Marcia Wallace Edna Krabappel-Flanders; Ms. Melon; Mrs. Krabapatra 175
Phil Hartman Lionel Hutz; Heaven; Troy McClure; Moses; Cable guy; Plato; Joey; Godfather; Horst; Stockbroker; Smooth Jimmy Apollo; Lyle Lanley; Security Guard; Mandy Patinkin; Tom; Eddie Muntz; Evan Conover; Charlton Heston; Fat Tony; Hospital chairman; Bill Clinton 73
Maurice LaMarche George C. Scott; Hannibal Lecter; Captain James T. Kirk; Eudora Welty; Commander McBragg; Orson Welles; Recruiter #2; Cap’n Crunch; First Mate Billy; Oceanographer; Farmer; Horn Stuffer; Fox announcer; Government Official; Jock; Toucan Sam; Trix Rabbit; Dwight D. Eisenhower; City Inspector; Nuclear Power Plant Guard; David Starsky; Anthony Hopkins; Charlie Sheen; Prepper; Chef Naziwa; Karl Malden; John Kerry; Milo; Football Commentator; Clive Meriwether; Neil Simon; Rodney Dangerfield; Morbo; Hedonismbot; Lrrr 38
Joe Mantegna Fat Tony; Himself playing Fat Tony; Fit Tony 30
Jon Lovitz Artie Ziff; Mr. Seckofsky; Professor Lombardo; Aristotle Amadopolis; Mr. Devaro; Llewellyn Sinclair; Ms. Sinclair; Jay Sherman; Llewelyn Sinclair; Aristotle Amadopoulis; Enrico Irritazio; Cigarette; Himself; Hacky; Snitchy the Weasel; Rabbi 28

Frequent roles

The most frequent roles played by guest stars are either themselves or some supporting characters, such as the teacher Edna, the gangster Fat Tony, and the actor Troy McClure.

Top 10 guest star roles
Role Frequency
Himself 336
Edna Krabappel-Flanders 173
Herself 59
Fat Tony 29
Troy McClure 29
Lionel Hutz 25
Sideshow Bob 21
Themselves 19
Rabbi Hyman Krustofsky 11
Mona Simpson 9

Who are the guest stars who played themselves in multiple episodes? At the top, we find the physicist and cosmologist Stephen Hawking with 4 appearances across 30 seasons, followed by the comic-book writer Stan Lee, the filmmaker Ken Burns, and the actor Gary Coleman, all of them with 3 occurrences. The gender imbalance in the original characters is also reflected in the guest appearances, with just 5 women playing themselves twice in 30 seasons. The majority of the guest stars, however, just appears in a single episode.

It would be interesting to see whether the number of guest appearances has changed over time. In the show’s early years, most guest stars have voiced original characters, but as the show has continued, the number of those appearing as themselves has increased, especially throughout seasons 12 to 19. In more recent seasons, the gap between the two conditions seems to have become more pronounced.

When guest stars are voicing a character, how long do they talk? Let us explore the average number of lines reserved for guest stars. The guest stars with the most lines per episode are the ones voicing a narrator or an announcer and are usually not playing themselves. The only exception of a guest star playing themselves and having a fair amount of lines has been Lady Gaga.

Top 15 guest stars with the highest number of lines per episode
Guest star Role Playing themselves Number of episodes Number of lines Number of lines per episode
Larry McKay Announcer FALSE 1 386 386
Matt Groening Announcer FALSE 1 386 386
Phil Hartman Fat Tony FALSE 1 276 276
Clarence Clemons Narrator FALSE 1 156 156
Daniel Stern Narrator FALSE 1 156 156
George Fenneman Narrator FALSE 1 156 156
Jim Forbes Narrator FALSE 1 156 156
Ken Burns Narrator FALSE 1 156 156
Marc Wilmore Narrator FALSE 1 156 156
Matt Dillon Louie FALSE 1 104 104
Greg Berg Eddie FALSE 1 96 96
James Earl Jones Narrator FALSE 2 156 78
Lady Gaga Lady Gaga TRUE 1 78 78
Kristen Wiig Annie Crawford FALSE 1 74 74
Steve Carell Dan Gillick FALSE 1 64 64

Guest stars playing themselves tend to have fewer lines than those playing an actual character on the show.

Is there a difference in IMDb ratings, IMDb votes, and TV views in the episodes with guest stars playing themselves versus playing an original character? Somehow. It seems that episodes with guests starring themselves have, on average, lower ratings, votes, and views than the episodes with no guest star. According to the two-samples Wilcoxon test, the differences in the mean levels are statistically significant at a 5% level.

Average performance indicators and p-values of Wilcoxon rank sum test
Guest star IMDb rating IMDb votes TV views in US (millions)
Playing an original character 7.384 836.552 11.763
Playing themselves 7.275 777.821 11.216
p-value 0.007 0.037 0.003
plot_violin <- function(df, x, y, ylab, title = "", limits = NULL,
                        breaks = NULL, labels = comma){

  x <- enquo(x)
  y <- enquo(y)

  data_summary <- function(x) {
    m <- mean(x)
    ymin <- m-sd(x)
    ymax <- m+sd(x)
    return(c(y=m,ymin=ymin,ymax=ymax))
  }

  p <- df %>%
    filter(!is.na(!! x)) %>%
    group_by(!! x) %>%
    ggplot(aes(!! x, !! y, fill = !! x)) +
    geom_violin() +
    scale_fill_manual(name = "Guest star", values = colors[3:4],
                      labels = c("FALSE" = "Playing an original character", "TRUE" = "Playing themselves")) +
    scale_x_discrete(labels = c("FALSE" = "", "TRUE" = "")) +
    labs(x = "", y = ylab, title = title)  +
    stat_summary(fun.data = data_summary, geom = "pointrange", color = "black",
                 show.legend = FALSE)

  if(!is.null(breaks)){
    p <- p +
      scale_y_continuous(labels = labels, breaks = breaks, limits = limits)
  }else{
    p <- p +
      scale_y_continuous(labels = labels, limits = limits)
  }
  p
}

plot.guest.ratings <- episodes %>%
  plot_violin(x = self, y = rating, ylab = "Rating score", title = "IMDb ratings",
              limits = c(1, 10), breaks = seq(0, 10, 2))

plot.guest.votes <- episodes %>%
  plot_violin(x = self, y = votes, ylab = "Number of votes", title = "IMDb votes",
              labels = comma, limits = c(0, 4000))

plot.guest.views <- episodes %>%
  plot_violin(x = self, y = us_views, ylab = "Number of US viewers", title = "TV views in the US",
              labels = unit_format(unit = "", scale = 1e+6, big.mark = ","),
              limits = c(0, 35))

ggarrange(plot.guest.ratings, plot.guest.votes, plot.guest.views, nrow = 1, common.legend = TRUE,
             legend="bottom", widths = c(0.95, 0.97, 1.08))

   

Text analysis

Let us now carry out some text analysis on dialogues. In this scenario, the dialogues of each character are acting as the documents of the corpus.

Word frequency

Frequent words

Let us have a look at the most frequent words. By choosing a distinct combination of role, word, and line number, we are preventing from counting the same word from the same line multiple times. The most recurrent words - after removing the stop words - seem to be related to the characters addressing each other.

Top 15 most frequent words
Character Word Gender Frequency
Homer Simpson marge Male 1,752
Marge Simpson homer Female 1,319
Lisa Simpson dad Female 1,076
Homer Simpson hey Male 933
Bart Simpson dad Male 876
Lisa Simpson bart Female 708
Homer Simpson gonna Male 694
Homer Simpson yeah Male 691
Bart Simpson hey Male 662
Lisa Simpson mom Female 612
Homer Simpson uh Male 607
Homer Simpson boy Male 583
Marge Simpson bart Female 570
Homer Simpson time Male 558
Marge Simpson homie Female 525

Peculiar words

Let us compute the term frequency (tf), the inverse document frequency (idf), and the tf-idf. The latter looks for the most important words in each document that are not too common in other documents. In our case, this means finding the words that are peculiar to a particular character, but generally not to other characters.

We can use the tf-idf as a catchphrase detector. Specifically, we are looking at the characters with a fair amount of dialogues (i.e., more than 500 words), and keep one row for each character (to find one peculiar word for every role). For some characters, the peculiar word is the name of the character they usually talk to (e.g., Smithers saying ‘sir’ or Agnes Skinner saying ‘Seymour’). In contrast, for others, it is either the word they use to introduce themselves (e.g., Troy McClure saying ‘I’m Troy McClure’) or recurring sounds (e.g., the Captain going ‘Arrr’ or Nelson ‘haw’).

Bigrams Analysis

Let us now focus on the bigrams, that is, the pairs of words that often occur together.

Frequent bigrams

The most recurrent bigrams concern the members of the Simpson family (e.g., ‘homer simpson’, ‘bart simpson’, ‘lisa simpson’) or some onomatopoeia (e.g., ‘woo hoo’, ‘hey hey’, ‘la la’).

Top 10 bigrams throughout 26 seasons
Bigram Frequency
homer simpson 461
woo hoo 360
hey hey 311
la la 268
bart simpson 258
heh heh 221
ha ha 215
uh huh 210
haw haw 184
lisa simpson 174

Peculiar bigrams

The peculiar bigrams can be found as the bigrams with the largest tf-idf and that occur over 50 times. At the top, we find the signature mocking laugh of Nelson ‘haw haw’, and ‘kent brockman’ as the TV announcer is always starting off with ‘This is Kent Brockman’.

Peculiar bigrams occurring over 50 times throughout 26 seasons
Character Bigram Frequency tf idf tf-idf
Nelson Muntz haw haw 133 0.13 4.83 0.62
Kent Brockman kent brockman 70 0.02 5.45 0.13
Krusty the Clown hey hey 80 0.03 4.09 0.13
Moe Szyslak hey hey 57 0.02 4.09 0.06
Homer Simpson woo hoo 311 0.01 5.15 0.06
Bart Simpson hey dad 55 0.00 6.50 0.03

Networks

The relationships across the bigrams can be depicted through a network plot. To keep the plot readable, we consider bigrams that occurred at least 30 times. The nodes from which most of the arrows are departing seem to be ‘simpson’, ‘dollars’, and ‘hoo’. All in all, the most common bigrams seem to refer to either character names, locations, or onomatopeia.

Sentiment Analysis

Let us carry out a sentiment analysis to explore the feelings that emerge from The Simpsons dialogues. We are using the ‘bing’ lexicon, which attributes a positive or negative valence to every word in its vocabulary.

Which words contribute the most to the positive and negative sentiments?

Jointly

To get more insightful results, we only consider the words occurring at least 400 times. Among the positive words, we find e.g., ‘love’, ‘wow’, ‘nice’, and ‘fine’, whereas among the negative ones ‘bad’, ‘burns’, ‘stupid’, and ‘kill’. The word ‘burns’ is being associated with a negative sentiment because it is seen as coming from the verb ‘to burn’, which clearly has a negative connotation. In the Simpsons’ case though, the word ‘burns’ is likely to just refer to the character called Mr. Burns. However, due to the evil and greedy nature of the character himself, the graph seems pretty accurate after all!

Per character

The words most responsible for the positive (e.g., ‘love’, ‘nice’, ‘fine’) and negative sentiments (e.g., ‘bad’, ‘wrong’) do not seem to depend that much on the character.

Word clouds

Let us now have a look at some word clouds, which give us an idea of the most recurrent words throughout the seasons. To get more insightful results, we remove from dialogues.tidy some onomatopoeia.

Per character

Let us depict a separate word cloud for the most talkative characters of the show, to get an insight into their most recurrent ‘themes’. The most recurrent words for each character are the ones used for interacting with the other characters on the show. As we would expect, we discover a broad theme revolving around school and friends for Bart and Lisa Simpson, around the bar for Moe, and the nuclear plant for Mr. Burns.

Topic Modelling

We conclude this analysis with some topic modelling. To this end, we need to construct the term document matrix of dialogues. It turns out that considering the show episodes as documents does not provide much insight, as each episode touches on various topics. Therefore, we consider as a document the lines pronounced by a certain character.

To get more meaningful results, when constructing the document term matrix, we only keep the words occurring at least 10 times, and whose tf-idf is higher than the 70% quantile.

Let us perform a Latent Dirichlet Allocation (LDA) analysis on the document term matrix. We allow a large number of hidden topics, say eight.

The plots below shows the 10 most representative words for each topic, and the four characters with the highest probabilities of belonging to each topic. Inspecting them will allow us to give a meaning to the topics, and find the underlying traits shared by groups of characters.

Topics Interpretation
1 It might call on the social life of Homer Simpson, as indicated by the words ‘moe’, ‘beer’, and ‘money’. Among the characters classified to this topic, we find the bar owner Moe, and Homer’s friends Lenny, Barney, and Carl. Ned Flanders is also allocated here, which explains the presence of words like ‘god’, ‘love’, and ‘lord’.
2 The words ‘sir’ and ‘chief’ seem to suggest employer-employee relationships. The top characters allocated to this topic include Smithers, Mr. Burns, Lou, Eddie, and Chief Wiggum. Smithers is Burns’ trusted personal assistant, whereas Lou and Eddie are the two police officers who aid Chief Wiggum on almost every mission.
3 It seems to revolve around school, as demonstrated by the words ‘children’, ‘willie’, ‘edna’, and ‘principal’. As we could expect, the top characters for this topic are all connected to Springfield Elementary School. They include the principal Skinner, the teachers Edna Krabappel-Flanders and Miss Hoover, the superintendent Chalmers, and the weird student Ralph Wiggum.
4 The meaningful words of this topic (‘homie’, ‘honey’, and ‘husband’) touch on family and marital aspects. Interestingly enough, we find not only Marge, but also Mona Simpson (Homer’s mother), Selma (Marge’s sister), and Manjula (Apu’s wife).
5 The words ‘lisa’, and ‘milhouse’ possibly hint to kid-related topics. The top characters include Lisa, Milhouse, young Karl, and young Lenny.
6 We read ‘marge’, ‘boy’, ‘bart’, ‘lisa’, ‘stupid’, and ‘flanders’, and we can’t help thinking about Homer’s world! Homer is, of course, the main character of this topic, followed by other secondary characters.
7 We can be confident that the words ‘cool’, ‘krusty’, ‘boys’, and ‘milhouse’ revolve around the social life of Bart Simpson. Besides Bart, the other main characters are his bully friends Nelson, Jimbo, and Kearney, the recidivist criminal Snake Jailbird, and the drug-addict bus-driver Otto Man. What a crew!
8 The words ‘live’, ‘story’, and ‘coming’ allude to the theme of news and tv coverage. Local news often reports about Krusty the Clown, which is why the top two words refer to him. The main characters belonging to this topic are the narrator, some announcers, the news reporter Kent Brockman, and the sea Captain Horatio McCallister.

Conclusion

This analysis gave us some pretty interesting insights on The Simpsons show! We discovered the most popular characters, and the locations where they usually interact. We then analyzed the ratings, votes, and views of the episodes across 27 seasons. We also found the most recurring guest stars, and evaluated whether their presence had any impact on the number of lines or ratings. Thanks to the scripts’ availability, we carried out some text analysis that, among other things, allowed us to inspect the underlying sentiments and the main topics of the show.